传输层(Transport Layer)

Transport services(运输层服务)

Transport services and proto

provide logical communication(逻辑通信) between app processes running on different hosts
transport protocols run in end systems
- send side: breaks app messages into segments, passes to network layer
- rcv side: reassembles segments(报文段) into messages, passes to app layer
more than one transport protocol available to apps
- Internet: TCP and UDP

Transport vs. network layer(运输层和网络层的关系)

network layer: logical communication between hosts
transport layer: logical communication between processes
- relies on, enhances, network layer services

e.g.:
12 kids in Ann’s house sending letters to 12 kids in Bill’s house

hosts = houses
processes = kids
app messages = letters in envelopes
transport protocol = Ann and Bill who demux to in-house siblings
network-layer protocol = postal service

Internet transport-layer protocols(因特网传运输层)

reliable, in-order delivery (TCP)
- congestion control
- flow control
- connection setup
unreliable, unordered delivery: UDP
- no-frills extension of “best-effort” IP(尽力而为)

services not available:

delay guarantees
bandwidth guarantees

multiplexing and demultiplexing(多路复用与多路分解)

multiplexing at sender: handle data from multiple sockets, add transport header (later used for demultiplexing)

demultiplexing at receiver: use header info to deliver received segments to correct socket

How demultiplexing works

host receives IP datagrams

each datagram has source IP address, destination IP address
each datagram carries one transport-layer segment
each segment has source, destination port number

host uses IP addresses & port numbers to direct segment to appropriate socket

Connectionless demultiplexing(无连接的多路分解)

created socket has host-local port #:

1	DatagramSocket mySocket1 = new DatagramSocket(12534);

when creating datagram to send into UDP socket, must specify
- destination IP address
- destination port #
when host receives UDP segment:
- checks destination port # in segment
- directs UDP segment to socket with that port #

IP datagrams with same dest. port #, but different source IP addresses and/or source port numbers will be directed to same socket at dest

Connection-oriented demux(面向连接的多路分解)

TCP socket identified by 4-tuple:

source IP address
source port number
dest IP address
dest port number

demux: receiver uses all four values to direct segment to appropriate socket

server host may support many simultaneous TCP sockets:

each socket identified by its own 4-tuple
web servers have different sockets for each connecting client
- non-persistent HTTP will have different socket for each request

connectionless transport: UDP(无连接运输: UDP)

UDP: User Datagram Protocol [RFC 768]

关于何时、发送什么数据的应用层控制更为精细
无需连接建立
无连接状态
分组首部开销小
“no frills,” “bare bones” Internet transport protocol
“best effort” service, UDP segments may be:
- lost
- delivered out-of-order to app
connectionless:
- no handshaking between UDP sender, receiver
- each UDP segment handled independently of others
UDP use:
- streaming multimedia apps (loss tolerant, rate sensitive)
- DNS
- SNMP
reliable transfer over UDP:
- add reliability at application layer
- application-specific error recovery

UDP: segment header(UDP报文段首部)

length: in bytes of UDP segment, including header

why is there a UDP
- no connection establishment (which can add delay)
- simple: no connection state at sender, receiver
- small header size
- no congestion control: UDP can blast away as fast as desired

UDP checksum(UDP检验和)

端到端原则(end-end principle)
Goal: detect “errors” (e.g., flipped bits) in transmitted segment
sender:
- treat segment contents, including header fields, as sequence of 16-bit integers
- checksum: addition (one’s complement sum) of segment contents
- sender puts checksum value into UDP checksum field
receiver:
- compute checksum of received segment
- check if computed checksum equals checksum field value:
- NO - error detected
- YES - no error detected. But maybe errors nonetheless? More later ….

e.g.: add two 16-bit integers

Note: when adding numbers, a carryout from the most significant bit needs to be added to the result

UDP Pseudo-Header(UDP伪首部)

Protocol – 17 (UDP)

e.g. Checksum calculation of a simple UDP user datagram

All 0s : Pending to 16bits

principles of reliable data transfer(可靠数据传输原理)

important in application, transport, link layers
: top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
rdt: reliable data transfer

rdt_send(): called from above, (e.g., by app.). Passed data to deliver to receiver upper layer
udt_send(): called by rdt, to transfer packet over unreliable channel to receiver
rdt_rcv(): called when packet arrives on rcv-side of channel
deliver_data(): called by rdt to deliver data to upper

rdt1.0: reliable transfer over a reliable channel(经完全可靠信道的可靠数据传输)

incrementally develop sender, receiver sides of reliable data transfer protocol (rdt)
consider only unidirectional data transfer
- but control info will flow on both directions!
use finite state machines (FSM) to specify sender, receiver
underlying channel perfectly reliable
- no bit errors
- no loss of packets
separate FSMs for sender, receiver:
- sender sends data into underlying channel
- receiver reads data from underlying channel

rdt2.0: channel with bit errors(经具有比特差错信道的可靠数据传输)

underlying channel may flip bits in packet
- checksum to detect bit errors
the question: how to recover from errors:
- acknowledgements(ACKs, 肯定确认): receiver explicitly tells sender that pkt received OK
- negative acknowledgements(NAKs, 否定确认): receiver explicitly tells sender that pkt had errors
- sender retransmits(重传) pkt on receipt of NAK
自动重传请求协议(Automatic Repeat reQuest, ARQ)
new mechanisms in rdt2.0 (beyond rdt1.0):
- error detection(差错检测)
- receiver feedback(接收方反馈): control msgs (ACK,NAK) rcvr->sender
停等(stop-and-wait)协议

rdt2.0 has a fatal flaw

what happens if ACK/NAK corrupted
- sender doesn’t know what happened at receiver
- can’t just retransmit: possible duplicate
冗余分组(duplicate packet)
handling duplicates:
- sender retransmits current pkt if ACK/NAK corrupted
- sender adds sequence number to each pkt
- receiver discards (doesn’t deliver up) duplicate pkt
sender sends one packet, then waits for receiver response

rdt2.1: sender, handles garbled(含糊不清的) ACK/NAKs

sender:
- seq # added to pkt
- two seq. #’s (0,1) will suffice.
- must check if received ACK/NAK corrupted
- twice as many states
- - state must “remember” whether “expected” pkt should have seq # of 0 or 1
receiver:
- must check if received packet is duplicate
- state indicates whether 0 or 1 is expected pkt seq #
- note: receiver can not know if its last ACK/NAK received OK at sender

rdt2.2: a NAK-free protocol

same functionality as rdt2.1, using ACKs only
instead of NAK, receiver sends ACK for last pkt received OK
- receiver must explicitly include seq # of pkt being ACKed
duplicate ACK at sender results in same action as NAK: retransmit current pkt

rdt3.0: channels with errors and loss(经具有比特差错的丢包信道的可靠数据传输)

new assumption: underlying channel can also lose packets (data, ACKs)

checksum, seq. #, ACKs, retransmissions will be of help … but not enough

approach: sender waits “reasonable” amount of time for ACK

retransmits if no ACK received in this time
if pkt (or ACK) just delayed (not lost):
retransmission will be duplicate, but seq. #’s already handles this
receiver must specify seq # of pkt being ACKed
requires countdown timer

rdt3.0 in action

Performance of rdt3.0

rdt3.0: stop-and-wait operation(停等)

rdt3.0 is correct, but performance stinks
e.g.: 1 Gbps link, 15 ms prop. delay, 8000 bit packet:
$D_{trans} = \frac{L}{R} = \frac{8000bits}{10^9bits/sec} = 8microsecs$
RTT = 30ms
used ratio
$ U_{sender} = \frac{L/R}{RTT + L/R} = \frac{0.008}{30.008} = 0.00027 $
33kB/sec thruput over 1 Gbps link
network protocol limits use of physical resources

Pipelined protocols(流水线可靠数据传输协议)

pipelining: sender allows multiple, “in-flight”, yet-to-be-acknowledged pkts
- range of sequence numbers must be increased
- buffering at sender and/or receiver
two generic forms of pipelined protocols: go-Back-N, selective repeat

3-packet pipelining increases utilization(利用率) by a factor of 3
$ U_{sender} = \frac{3L/R}{RTT + L/R} = \frac{0.024}{30.008} = 0.00081 $
Go-back-N(GBN, 回退N步):
- sender can have up to N unacked packets in pipeline
- receiver only sends cumulative ack
- - doesn’t ack packet if there’s a gap
- sender has timer for oldest unacked packet
- - when timer expires, retransmit all unacked packets
Selective Repeat(SR, 选择重传):
- sender can have up to N unack’ed packets in pipeline
- receiver sends individual ack for each packet
- sender maintains timer for each unacked packet
- - when timer expires, retransmit only that unacked packet

Go-Back-N

Sender

k-bit seq # in pkt header
“window” of up to N, consecutive unack’ed pkts allowed
ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”
- may receive duplicate ACKs (see receiver)
timer for oldest in-flight pkt
timeout(n): retransmit packet n and all higher seq # pkts in window

窗口长度N
滑动窗口协议(sliding-window protocol)

sender extended FSM

receiver extended FSM

ACK-only: always send ACK for correctly-received pkt with highest in-order seq #
- may generate duplicate ACKs
- need only remember expectedseqnum
out-of-order pkt:
- discard (don’t buffer): no receiver buffering
- re-ACK pkt with highest in-order seq #

GBN in action

累积确认(cumulative acknowledgment)

Selective repeat

receiver individually acknowledges all correctly received pkts
- buffers pkts, as needed, for eventual in-order delivery to upper layer
sender only resends pkts for which ACK not received
- sender timer for each unACKed pkt
sender window
- N consecutive seq #’s
- limits seq #s of sent, unACKed pkts

sender, receiver windows:

sender
- data from above:
- - if next available seq # in window, send pkt
- timeout(n):
- - resend pkt n, restart timer
- ACK(n) in [sendbase,sendbase+N]:
- - mark pkt n as received
- - if n smallest unACKed pkt, advance window base to next unACKed seq #
receiver
- pkt n in [rcvbase, rcvbase+N-1]
- - send ACK(n)
- - out-of-order: buffer
- - in-order: deliver (also deliver buffered, in-order pkts), advance window to next not-yet-received pkt
- pkt n in [rcvbase-N,rcvbase-1]
- - ACK(n)
- otherwise: ignore

Selective repeat in action

Selective repeat: dilemma

example:
seq #’s: 0, 1, 2, 3
window size=3

receiver sees no difference in two scenarios, duplicate data accepted as new in (b)
Q: what relationship between seq # size and window size to avoid problem in (b)?
SR协议中窗口长度必须小于或等于序号空间大小的一半

connection-oriented transport: TCP(面向连接的传输: TCP)

point-to-point: one sender, one receiver
reliable, in-order byte steam: no “message boundaries”
pipelined: TCP congestion and flow control set window size
full duplex data:
- bi-directional data flow in same connection
- MSS: maximum segment size
connection-oriented: handshaking (exchange of control msgs) inits sender, receiver state before data exchange
flow controlled: sender will not overwhelm receiver
流(stream): 没有报文边界的概念
最大报文段长度(MSS, Maximum Segment Size)
最大链路层帧长度(MTU, Maximum Transmission Unit, 最大传输单元)

TCP segment structure(TCP报文段结构)

sequence numbers(序号字段): byte stream “number” of first byte in segment’s data
acknowledgements(确认号字段):
- seq # of next byte expected from other side
- cumulative ACK
Q: how receiver handles out-of-order segments
A: TCP spec doesn’t say, - up to implementor
接收窗口字段(receive window): 用于流量控制，指示接收方愿意接受的字节数量

Q: how to set TCP timeout value?
longer than RTT but RTT varies
too short: premature timeout, unnecessary retransmissions
too long: slow reaction to segment loss
Q: how to estimate RTT(估计往返时间)?
SampleRTT: measured time from segment transmission until ACK receipt
- ignore retransmissions
SampleRTT will vary, want estimated RTT “smoother”
- average several recent measurements, not just current SampleRTT

$ EstimatedRTT = (1- \alpha)EstimatedRTT + \alphaSampleRTT $ *

exponential weighted moving average(EWMA, 指数加权移动平均)
influence of past sample decreases exponentially fast
typical value: $\alpha = 0.125$

timeout interval: EstimatedRTT plus “safety margin”
large variation in EstimatedRTT -> larger safety margin
estimate SampleRTT deviation from EstimatedRTT:
$$
DevRTT = (1-\beta)DevRTT +\beta |SampleRTT-EstimatedRTT|
(typically, \beta = 0.25)
$$

TimeoutInterval = EstimatedRTT(estimated RTT) + 4*DevRTT(“safety margin”)

TCP reliable data transfer(可靠数据传输)

TCP creates rdt service on top of IP’s unreliable service
- pipelined segments
- cumulative acks
- single retransmission timer
retransmissions triggered by:
- timeout events
- duplicate acks
let’s initially consider simplified TCP sender:
ignore duplicate acks
ignore flow control, congestion control

TCP sender events:

data received from app:
- create segment with seq #
- seq # is byte-stream number of first data byte in segment
- start timer if not already running
- - think of timer as for oldest unacked segment
- - expiration interval: TimeOutInterval
timeout:
- retransmit segment that caused timeout
- restart timer
ack received: if ack acknowledges previously unacked segments
- update what is known to be ACKed
- start timer if there are still unacked segments

TCP sender (simplified)

retransmission scenarios:

TCP ACK generation [RFC 1122, RFC 2581]

event at receiver	TCP receiver action
arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed	delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK
arrival of in-order segment with expected seq #. One other segment has ACK pending	immediately send single cumulative ACK, ACKing both in-order segments
arrival of out-of-order segment higher-than-expect seq #. Gap detected	immediately send duplicate ACK, indicating seq. # of next expected byte
arrival of segment that partially or completely fills gap	immediate send ACK, provided that segment starts at lower end of gap

TCP fast retransmit(快速重传)

time-out period often relatively long: long delay before resending lost packet
detect lost segments via duplicate ACKs.
- sender often sends many segments back-to-back
- if segment is lost, there will likely be many duplicate ACKs.
if sender receives 3 dupl ACKs for same data(“triple duplicate ACKs”), resend unacked segment with smallest seq #
likely that unacked segment lost, so don’t wait for timeout

TCP flow control(TCP流量控制)

receiver controls sender, so sender won’t overflow receiver’s buffer by transmitting too much, too fast

receiver “advertises” free buffer space by including rwnd value in TCP header of receiver-to-sender segments
- RcvBuffer size set via socket options (typical default is 4096 bytes)
- many operating systems autoadjust RcvBuffer
sender limits amount of unacked (“in-flight”) data to receiver’s rwnd value
guarantees receive buffer will not overflow

Connection Management(TCP连接管理)

before exchanging data, sender/receiver “handshake”:
- agree to establish connection (each knowing the other willing to establish connection)
- agree on connection parameters

Q: will 2-way handshake always work in network?
variable delays
retransmitted messages (e.g. req_conn(x)) due to message loss
message reordering
can’t “see” other side

2-way handshake failure scenarios:

TCP 3-way handshake(三次握手)

TCP 3-way handshake: FSM

TCP: closing a connection(四次挥手)

client, server each close their side of connection
- send TCP segment with FIN bit = 1
respond to received FIN with ACK
- on receiving FIN, ACK can be combined with own FIN
simultaneous FIN exchanges can be handled

Principles of congestion control(拥塞控制原理)

congestion:
informally: “too many sources sending too much data too fast for network to handle”
different from flow control!
manifestations:
- lost packets (buffer overflow at routers)
- long delays (queueing in router buffers)
a top-10 problem

Causes/costs of congestion: scenario

two senders, two receivers
one router, infinite buffers
output link capacity: R
no retransmission

one router, finite buffers
sender retransmission of timed-out packet
- application-layer input = application-layer output: $\lambda{in} = \lambda{out}$
- transport-layer input includes retransmissions: $\lambda{in}^{‘} \geq \lambda{in}$

idealization: perfect knowledge
- sender sends only when router buffers available

Idealization: known loss packets can be lost, dropped at router due to full buffers
- sender only resends if packet known to be lost

Realistic: duplicates
- packets can be lost, dropped at router due to full buffers
- sender times out prematurely, sending two copies, both of which are delivered

“costs” of congestion:
- more work (retrans) for given “goodput”
- unneeded retransmissions: link carries multiple copies of pkt
- - decreasing goodput
four senders
multihop paths
timeout/retransmit
Q: what happens as $\lambda{in}$ and $\lambda{in}^{’}$ increase ?
A: as red $\lambda_{in}^{’}$ increases, all arriving blue pkts at upper queue are dropped, blue throughput $\to$ 0

another “cost” of congestion:
- when packet dropped, any upstream transmission capacity used for that packet was wasted(上游路由器用于转发该分组而使用的传输容量最终被浪费掉了)

TCP congestion control: additive increase multiplicative decrease(AIMD, 加性增，乘性减)

approach: sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs
- additive increase: increase cwnd by 1 MSS every RTT until loss detected
- multiplicative decrease: cut cwnd in half after loss

sender limits transmission: $ LastByteSend - LastByteAcked \leq cwnd $
cwnd(拥塞窗口长度) is dynamic, function of perceived network congestion
TCP sending rate:
- roughly: send cwnd bytes, wait RTT for ACKS, then send more bytes
- $ rate \approx \frac{cwnd}{RTT} bytes/sec $

TCP Slow Start(慢启动)

when connection begins, increase rate exponentially until first loss event:
- initially cwnd = 1 MSS
- double cwnd every RTT
- done by incrementing cwnd for every ACK received
summary: initial rate is slow but ramps up exponentially fast

detecting, reacting to loss

loss indicated by timeout:
- cwnd set to 1 MSS;
- window then grows exponentially (as in slow start) to threshold, then grows linearly(进入拥塞避免状态)
loss indicated by 3 duplicate ACKs: TCP RENO(进入快速恢复状态)
- dup ACKs indicate network capable of delivering some segments
- cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)(进入慢启动状态)

switching from slow start to CA(Congestion Avoid)

Q: when should the exponential increase switch to linear?
A: when cwnd gets to 1/2 of its value before timeout.
Implementation:
- variable ssthresh
- on loss event, ssthresh is set to 1/2 of cwnd just before loss event

TCP throughput(TCP吞吐量)

avg. TCP thruput as function of window size, RTT?
- ignore slow start, assume always data to send
W: window size (measured in bytes) where loss occurs
- avg. window size (# in-flight bytes) is 3/4 W
- avg. thruput is 3/4W per RTT
- $ avg TCP throughput = \frac{3}{4} \frac{W}{RTT} bytes/sec $

TCP Futures: TCP over “long, fat pipes”(经高带宽路径的TCP)

example: 1500 byte segments, 100ms RTT, want 10 Gbps throughput
requires W = 83,333 in-flight segments
throughput in terms of segment loss probability, L [Mathis 1997]:
$ TCP thoughput = \frac{1.22*MSS}{RTT\sqrt{L}} $
to achieve 10 Gbps throughput, need a loss rate of L = 2·10-10 – a very small loss rate
new versions of TCP for high-speed

TCP Fairness(TCP公平性)

fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K
Why is TCP fair
two competing sessions:
- additive increase gives slope of 1, as throughout increases
- multiplicative decrease decreases throughput proportionally

Fairness and UDP

multimedia apps often do not use TCP
- do not want rate throttled by congestion control
instead use UDP:
- send audio/video at constant rate, tolerate packet loss

Fairness, parallel TCP connections

application can open multiple parallel connections between two hosts
web browsers do this
e.g., link of rate R with 9 existing connections:
- new app asks for 1 TCP, gets rate R/10
- new app asks for 11 TCPs, gets R/2

Explicit Congestion Notification (ECN)

network-assisted congestion control:
- two bits in IP header (ToS field) marked by network router to indicate congestion
- congestion indication carried to receiving host
- receiver (seeing congestion indication in IP datagram), sets ECE bit on receiver-to-sender ACK segment to notify sender of congestion